Improved Sentence-Level Arabic Dialect Classification
نویسندگان
چکیده
The paper presents work on improved sentence-level dialect classification of Egyptian Arabic (ARZ) vs. Modern Standard Arabic (MSA). Our approach is based on binary feature functions that can be implemented with a minimal amount of task-specific knowledge. We train a featurerich linear classifier based on a linear support-vector machine (linear SVM) approach. Our best system achieves an accuracy of 89.1 % on the Arabic Online Commentary (AOC) dataset (Zaidan and Callison-Burch, 2011) using 10-fold stratified cross validation: a 1.3 % absolute accuracy improvement over the results published by (Zaidan and Callison-Burch, 2014). We also evaluate the classifier on dialect data from an additional data source. Here, we find that features which measure the informalness of a sentence actually decrease classification accuracy significantly.
منابع مشابه
Arabic Dialect Identification
The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a nontrivial manner from the various spoken regional dialects of Arabic – the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. In this article, we des...
متن کاملImproved Arabic Dialect Classification with Social Media Data
Arabic dialect classification has been an important and challenging problem for Arabic language processing, especially for social media text analysis and machine translation. In this paper we propose an approach to improving Arabic dialect classification with semi-supervised learning: multiple classifiers are trained with weakly supervised, strongly supervised, and unsupervised data. Their comb...
متن کاملSentence Level Dialect Identification in Arabic
This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic. We use token level labels to derive sentence-level features. These features are then used with other core and meta features to train a generative classifier that predicts the correct label for each sentence in the given input text. The sy...
متن کاملArabic Dialect Identification Using a Parallel Multidialectal Corpus
We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously a...
متن کاملAIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic
In this paper, we present a hybrid approach for performing token and sentence levels Dialect Identification in Arabic. Specifically we try to identify whether each token in a given sentence belongs to Modern Standard Arabic (MSA), Egyptian Dialectal Arabic (EDA) or some other class and whether the whole sentence is mostly EDA or MSA. The token level component relies on a Conditional Random Fiel...
متن کامل